Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply

نویسندگان

Rajesh Nishtala

Richard W. Vuduc

James W. Demmel

Katherine A. Yelick

چکیده

We consider the problem of building high-performance implementations of sparse matrix-vector multiply (SpM×V), or y = y+A ·x, which is an important and ubiquitous computational kernel. Prior work indicates that cache blocking of SpM×V is extremely important for some matrix and machine combinations, with speedups as high as 3x. In this paper we present a new, more compact data structure for cache blocking for SpM×V and look at the general question of when and why performance improves. Cache blocking appears to be most effective when simultaneously 1) the vector x does not fit in cache 2) the vector y fits in cache 3) the non zeros are distributed throughout the matrix and 4) the non zero density is sufficiently high. In particular we find that cache blocking does not help with band matrices no matter how large x and y are since the matrix structure already lends itself to the optimal access pattern. Prior work on performance modeling assumed that the matrices were small enough so that x and y fit in the cache. However when this is not the case, the optimal block sizes picked by these models may have poor performance motivating us to update these performance models. In contrast, the optimum block sizes predicted by the new performance models generally match the measured optimum block sizes and therefore the models can be used as a basis for a heuristic to pick the block size. We conclude with architectural suggestions that would make processor and memory systems more amenable to SpM×V.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Run-Time Reference Clustering for Cache Performance Optimization

We introduce a method for improving the cache performance of irregular computations in which data are referenced through run-time defined indirection arrays. Such computations often arise in scientific problems. The presented method, called Run-Time Reference Clustering (RTRC), is a run-time analog of a compile-time blocking used for dense matrix problems. RTRC uses the data partitioning and re...

متن کامل

Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure

We improve the performance of sparse matrix-vector multiply (SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this kind of structure. Our technique splits the matrix, A, into a sum, A1 + A2 + . . . + As, where each term is stored in a new data str...

متن کامل

An Improved Sparse Matrix-Vector Multiply Based on Recursive Sparse Blocks Layout

The Recursive Sparse Blocks (RSB) is a sparse matrix layout designed for coarse grained parallelism and reduced cache misses when operating with matrices, which are larger than a computer’s cache. By laying out the matrix in sparse, non overlapping blocks, we allow for the shared memory parallel execution of transposed SParse Matrix-Vector multiply (SpMV ), with higher efficiency than the tradi...

متن کامل

Optimizing Sparse Matrix Vector Multiplication on SMPs

We describe optimizations of sparse matrix-vector multiplication on uniprocessors and SMPs. The optimization techniques include register blocking, cache blocking, and matrix reordering. We focus on optimizations that improve performance on SMPs, in particular, matrix reordering implemented using two diierent graph algorithms. We present a performance study of this algorithmic kernel, showing ho...

متن کامل

Innuence of Cross-interferences on Blocked Loops: a Case Study with Matrix-vector Multiply

State-of-the art data locality optimizing algorithms are targeted for local memories rather than for cache memories. Recent work on cache interferences seems to indicate that these phenomena can severely aaect blocked algorithms cache performance. Because of cache connicts, it is not possible to know the precise gain brought by blocking. It is even diicult to determine for which problem sizes b...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply

نویسندگان

چکیده

منابع مشابه

Run-Time Reference Clustering for Cache Performance Optimization

Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure

An Improved Sparse Matrix-Vector Multiply Based on Recursive Sparse Blocks Layout

Optimizing Sparse Matrix Vector Multiplication on SMPs

Innuence of Cross-interferences on Blocked Loops: a Case Study with Matrix-vector Multiply

عنوان ژورنال:

اشتراک گذاری